148        Bioinformatics

conservation score, number of sequences, dbSNP accession, and SIFT prediction whether

the variant is tolerated or deleterious.

4.4.2  SnpEff

SnpEff [12] is another variant annotation tool that categorizes the coding effects of variants

based on their genomic locations such as introns, untranslated region (UTR), upstream,

downstream, and splicing site. SnpEff predicts a variety of variant effects including syn-

onymous or nonsynonymous substitution, start-gain codon, start-loss codon, stop-gain

codon, stop-loss codon, or frameshifts.

In general, SnpEff consists of two main components: (i) database builds and (ii) vari-

ant effect calculation. The SnpEff database builds are usually distributed with SnpEff and

there are around hundreds of databases available. A database build is a gzip-compressed

serialized object that is formed of the genome FASTA sequence and an annotation file (in

GTF or GFF format). These database files can be acquired from database resources such as

ENSEMBL and UCSC. The variant effect calculation is performed after building the data-

base. It begins with building a data structure which is a hash table interval trees indexed

by chromosome. The data structure indexes intervals and makes their search efficient. The

SnpEff program uses the VCF file as input and finds the intersections with the annotated

database. The intersecting genomic regions are then identified and the variant effect is

calculated from exonic region only. Simply, SnpEff will take information from the pro-

vided annotation database and populate the input VCF file by adding annotation into the

INFO field name, ANN. Data fields are encoded separated by pipe sign “|”; and the order

of fields is written in the VCF header. As examples, variants may be categorized by SnpEff

as SNP (single-nucleotide polymorphism), Ins (insertion), Del (deletion), MNP (multiple-

nucleotide polymorphism), or MIXED (multiple-nucleotide and InDel). The impacts of

variants are classified into high, moderate, low, or modifier based on the affected region. A

variant will have a high impact when it is disruptive and likely to cause protein truncation,

loss of function, or triggering nonsense mediated decay. The variants with high impact

are frameshift and stop-gain variants. The non-disruptive variants such as missense SNV

and inframe deletion that might change protein effectiveness only are moderate impact

FIGURE 4.10  SIFT 4G annotation file.